Data communication and data science

GEOG 30323

November 27, 2018

Course recap

  • Thus far: we’ve focused on exploratory data analysis, which involves data wrangling, summarization, and visualization
  • Your data analysis journey shouldn’t stop here! Topics to consider:
    • Explanatory vs. exploratory visualization
    • Statistics and data science
    • Data ethics and “big data” (next week)

Communicating with data

  • Once you’ve done all of the hard work wrangling your data, you’ll want to communicate insights to others!
  • This might include:
    • Polished data products or reports
    • Models that can scale your insights

Explanatory visualization

  • We’ve largely worked to this point with exploratory visualization, which refers to internally-facing visualizations that help us reveal insights about our data
  • Often, externally-facing data products will include explanatory visualization, which include a polished design and emphasize one or two key points

Interactive reports

  • Example: a data journalism article - or your Jupyter Notebook!
  • Key distinction: your code, data exploration, etc. will likely be external to the report (this can vary depending on the context, however)

Tableau

  • Highly popular software for data visualization - both exploratory and explanatory
  • Intuitive, drag-and-drop interface
  • Key feature: the dashboard

Data dashboards

Demo: Tableau Public

Infographics

Obesity infographics:

Are infographics useful?

Data Science

  • Data science: new(ish) field that has emerged to address the challenges of working with modern data
  • Fuses statistics, computer science, visualization, graphic design, and the humanities/social sciences/natural sciences…

The data analysis process

Visualization vs. modeling

Hadley Wickham (paraphrased):

Visualization can surprise you, but it doesn’t scale well. Modeling scales well, but it doesn’t (fundamentally) surprise.

Statistical modeling

  • What is the mathematical relationship between an outcome variable \(Y\) and one or more other “predictor” variables \(X_{1}...X_{n}\)?
  • Recall our use of lmplot in seaborn - lm stands for linear model

Statistical modeling

The linear model:

\[ Y = Xb + e \]

where \(Y\) represents the outcome variable, \(X\) is a matrix of predictors, \(b\) represents the “parameters”, and \(e\) represents the errors, or “residuals”

  • Linear models will not always be appropriate for modeling relationships between variables!

Statistics in Python

  • Substantial statistical functionality available in the statsmodels package, which installs with Anaconda

Statistics in Python

Let’s get an example ready:

Linear regression

Multiple regression

Residuals and fitted values

Residuals and fitted values

Machine learning

  • “The science of getting computers to act without being explicitly programmed”
  • Types of machine learning algorithms: supervised and unsupervised
  • Topics in machine learning: classification, clustering, regression

Visual introduction to machine learning: http://www.r2d3.us/visual-intro-to-machine-learning-part-1/

In Python: scikit-learn

Example: K-means clustering

Example: K-means clustering

Example: nearest-neighbor search

Making predictions


How to learn more

  • Take statistics and machine learning courses here at TCU!
  • Check out DataCamp for hundreds of courses on data science in Python and R